drawing

Week 9 - Tree-Based Methods¶

Dr. David Elliott

  1. Introduction

  2. General Decision Tree Algorithm

  3. Specific Decision Tree Algorithms

1. Introduction ¶

Tree-based methods stratify or segment the predictor space into a number of simple regions1.

As the spliting rules to make these decision regions can be summerised in a tree structure, these approaches are called decision trees.

A decision tree can be thought of as breaking data down by asking a series of questions in order to categorise samples into the same class.

Terminology5¶

Root node: no incoming edge, zero, or more outgoing edges.

Internal node: one incoming edge, two (or more) outgoing edges.

Leaf node: each leaf node is assigned a class label if nodes are pure; otherwise, the class label is determined by majority vote.

Parent and child nodes: If a node is split, we refer to that given node as the parent node, and the resulting nodes are called child nodes.

Notes

  • Leaves are typically drawn upside down, so they are at the bottom of the tree
Figure 1: Categorical Decision Tree

Dataset Example: Penguins ¶

The "palmer penguins" dataset2 contains data for 344 penguins from 3 different species and from 3 islands in the Palmer Archipelago, Antarctica.


Artwork by @allison_horst

Figure 2: Penguin Buddies
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

2. General Decision Tree Algorithm ¶

An algorithm starts at a tree root and then splits the data based on the features that give the best split based on a splitting criterion.

Generally this splitting procedure occours until3,4...

  • ...all the samples within a given node all belong to the same class,
  • ...the maximum depth of the tree is reached,
  • ...a split cannot be found that improves the model.

NOTES

  • The process of growing a decision tree can be expressed as a recursive algorithm as follows5:
    1. Pick a feature such that when parent node is split, it results in the largest information gain and stopping if information gain is not positive.
    2. Stop if child nodes are pure or no improvement in class purity can be made.
    3. Go back to step 1 for each of the two child nodes.

Below is an example of a very shallow decision tree where we have set max_depth = 1.

Terminology (Reminder)1

  • The regions $R_1$ and $R_2$ are known as terminal nodes or leaves of the tree.
  • The points where the predictor space is split are known as the internal nodes. In this case as we only have one split this is the "root node".
  • The segments of the trees that connect the nodes are branches.

We can make the tree "deeper", and therefore more complex, by setting the max_depth = 3.

We could also use more than 2 features as seen below.

NOTES

  • Although more features are harder to plot decision boundaries (see dtreeviz).

Notes

  • There is an example of a 3D decision region on pg. 547 in Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

We could also also easily extend this to have more than a 2 (binary) class labels.

Estimating Class Probabilities3¶

We can estimate the probability that an instance belongs to a particular class easily.

For our new observation we...

  1. traverse the tree to find the leaf node the instance is assigned to,
  2. return the ratio of training instances of class $c$ in that node,
  3. assign our observation to the class with the highest probability.

Example

Using the model in figure 7, we could find the probability of the following penguins species:

bill_length_mm flipper_length_mm
65 41.6 192.0
Adelie Gentoo Chinstrap
0 0.0 0.0204 0.9796
Predicted Species is Chinstrap

In a general sense this approach is pretty simple, however there are a number of design choices and considerations we have to make including5:

  • How do we decide which features to select for splitting a parent node into child nodes?
  • How do we decide where to split features?
  • When do we stop growing a tree?
  • How do we make predictions if no attributes exist to perfectly separate non-pure nodes further?

3. Specific Decision Tree Algorithms ¶

Most decision tree algorithms address the following implimentation choices differently5:

  • Splitting Criterion: Information gain, statistical tests, objective function, etc.
  • Number of Splits: Binary or multi-way.
  • Variables: Discrete vs. Continuous.
  • Pruning: Pre- vs. Post-pruning.

There are a number of decision tree algorithms, prominant ones include:

  • Iterative Dichotomizer 3 (ID3)
  • C4.5
  • CART

Notes

  • Binary means nodes always have two children.

CART¶

Scikit-Learn uses an optimised version of the Classification And Regression Tree (CART) algorithm.

  • Splitting Criterion: Information gain
  • Number of Splits: Binary
  • Independent Variables (Features): Continuous
  • Dependent variable: Continuous or Categorical
  • Pruning: Pre- & Post-pruning

Notes

  • "scikit-learn uses an optimised version of the CART algorithm; however, scikit-learn implementation does not support categorical variables for now." https://scikit-learn.org/stable/modules/tree.html

Information Gain4¶

An algorithm starts at a tree root and then splits the data based on the feature, $f$, that gives the largest information gain, $IG$.

To split using information gain relies on calculating the difference between an impurity measure of a parent node, $D_p$, and the impurities of its child nodes, $D_j$; information gain being high when the sum of the impurity of the child nodes is low.

We can maximise the information gain at each split using,

$$IG(D_p,f) = I(D_p)-\sum^m_{j=1}\frac{N_j}{N_p}I(D_j),$$

where $I$ is out impurity measure, $N_p$ is the total number of samples at the parent node, and $N_j$ is the number of samples in the $j$th child node.

Some algorithms, such as Scikit-learn's implimentation of CART, reduce the potential search space by implimenting binary trees:

$$IG(D_p,f) = I(D_p) - \frac{N_\text{left}}{N_p}I(D_\text{left})-\frac{N_\text{right}}{N_p}I(D_\text{right}).$$

Notes

  • The CART algorithm is greedy - meaning it searches for the optimum split at each level. It does not check if this is the best split to improve impurity further down the tree3.
  • To find the optimal tree is known as an NP-Complete problem, meaning it is intractable even for small training sets3.

Three impurity measures that are commonly used in binary decision trees are the classification error ($I_E$), gini impurity ($I_G$), and entropy ($I_H$)4.

Classification Error4¶

This is simply the fraction of the training observations in a region that does belongs to the most common class:

$$I_E = 1 - \max\left\{p(i|t)\right\}$$

Here $p(i|t)$ is the proportion of the samples that belong to the $i$th class $c$, for node $t$.

Notes

  • Another way of writing this is $E = 1 - \substack{max\\k}(\hat p_{mk})$, where $\hat p_{mk}$ is the proportion of training observations in the $m$th region that are from the $k$th class1.

Entropy Impurity4¶

For all non-empty classes ($p(i|t) \neq 0$), entropy is given by

$$I_H=-\sum^c_{i=1}p(i|t)\log_2p(i|t).$$

The entropy is therefore 0 if all samples at the node belong to the same class and maximal if we have a uniform class distribution.

For example in binary classification ($c=2$):

  • $I_H=0 \text{ if } p(i=1|t)=1 \text{ or } p(i=0|t)=0$
  • $I_H=1 \text{ if } p(i=1|t)=0.5 \text{ or } p(i=0|t)=0.5$

Notes

  • In binary classification, entropy reaches its minimum (0) when all examples in a given node have the same label; on the other hand, the entropy is at its maximum of 1 when exactly one-half of examples a node are labeled with 1, which would be useless for classification.6
  • Another way of writing this is1 $D = -\sum^K_{k=1}\hat p_{mk}log \hat p_{mk}$.
    • Entropy will take on a value near 0 if the $\hat p_{mk}$'s are all near 0 or 1, therefore will take on a small value if the $m$th node is pure.

Gini Impurity4¶

Gini Impurity is an alternative measurement, which minimises the probabilty of misclassification,

$$ \begin{align} I_G(t) &= \sum^c_{i=1}p(i|t)(1-p(i|t)) \\ &= 1-\sum^c_{i=1}p(i|t)^2. \end{align} $$

This measure is also maximal when classes are perfectly mixed (e.g. $c=2$):

$$ \begin{align} I_G(t) &= 1 - \sum^c_{i=1}0.5^2 = 0.5. \end{align} $$

Notes

  • It is also a measure of node "purity" as a small value indicates a node contains predominantly observations from a single class.
  • Another way of writing this is1 for $K$ classes, $G = \sum^K_{k=1}\hat p_{mk}(1-\hat p_{mk})$.
    • It takes a small value if all of the $\hat p_{mk}$'s are close to 0 or 1.
  • Whether we use entropy or Gini impurity generally does not really matter, because both have the same concave/bell shape.
    • Gini is computationally more efficient to compute than entropy due to the lack of the log.
    • When they do differ, Gini is more likely to isolate the most frequenct class to its own branch and entropy produce slightly more ballanced trees3.

Feature Importance10,11¶

Decision trees allow us assess the importance of each feature for classifying the data,

$$ fi_j = \frac{\sum_{t \in s} ni_t}{\sum^m_t ni_t} $$

where $ni_t$ is the $t$th nodes importance, and $s$ are the indices of nodes that split on feature $fi_j$.

We often assess the normalized total reduction of the criterion (e.g. Gini) brought by that feature,

$$ normfi_j = \frac{fi_j}{\sum^p_j fi_j}. $$

Pruning¶

Question: When do we stop growing a tree?

Occam’s razor: Favor a simpler hypothesis, because a simpler hypothesis that fits the data equally well is more likely or plausible than a complex one5.

To minimize overfitting, we can either set limits on the trees before building them (pre-pruning), or reduce the tree by removing branches that do not significantly contribute (post-pruning).

NOTES

  • In other words, if decision trees are not pruned, they have a high risk of overfitting to the training data5.

Dataset Example: Breast Cancer Wisconsin Dataset12¶

Digitized image of a fine needle aspirate of a breast mass. The features describe characteristics of the cell nuclei present in the image.

The dataset was created from digitized images of healthy (benign) and cancerous (malignant) tissues.


Image from Levenson et al. (2015), PLOS ONE, doi:10.1371/journal.pone.0141357.

Notes

  • Fine needle aspiration is a type of biopsy procedure.

Pre-Pruning¶

An a priori limit on nodes, or tree depth, is often set to avoid overfitting due to a deep tree4,5.

Notes

  • Another one is to stop growing if a split is not statistically significant (e.g., $\chi^2$ test)5. However this is not yet availble in sklearn, although I'm sure you can find some code for it somewhere on the internet.

We could also set a minimum number of data points for each node5.

Post-Pruning¶

In general, post-pruning consists of going back through the tree once it has been created and removing branches that do not significantly contribute to the error reduction and replacing them with leaf nodes6

Two common approaches are reduced-error pruning and cost-complexity pruning

  • Reduced-error pruning5
    • Greedily remove nodes based on validation set performance
    • Generally improves performance but can be problematic for limited data set sizes.
  • Cost-complexity pruning5
    • Recursively finds the node with the “weakest link”.
    • Nodes are characterized by $\alpha \geq 0$, and nodes with the smallest effective $\alpha$ are pruned first7.
    • The trees are then defined as $I + \alpha|N|$, where $I$ is an impurity measure, such as the total misclassification rate of the terminal nodes, $\alpha$ is a tuning parameter, and $|N|$ is the total number of nodes5.

Notes

  • An early "bad" split may lead to a good split later, therfore we may want to grow a large tree first and prune it back to obtain a subtree1.
  • $\alpha$ is the price to pay for having a tree with many nodes, so this tends to minimise to a smaller subtree. This is reministant of the lasso.

Cost-complexity pruning7¶

Using Scikit-learn, we can recursively fit a complex tree with no prior pruning and have a look at the effective alphas and the corresponding total leaf impurities at each step of the pruning process.

As alpha increases, more of the tree is pruned, thus creating a decision tree that generalizes better.

We can select the alpha that reduces the distance between the train and validation scores.

Notes

  • In Scikit-learn 0.22 the parameter ccp_alpha was introduced (short for Cost Complexity Pruning- Alpha)
  • DecisionTreeClassifier.cost_complexity_pruning_path returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process.

Then we can train a decision tree using the chosen effective alpha.

Other Algorithms¶

ID3 - Iterative Dichotomizer 38

  • Splitting Criterion: Maximises Information Gain/ Minimises Entropy
  • Number of Splits: Multiway
  • Variables: Discrete binary and multi-category features
  • Pruning: None

C4.59

  • Splitting Criterion: Maximises Information Gain/ Minimises Entropy
  • Number of Splits: Multiway
  • Variables: Discrete & Continuous (expensive)
  • Pruning: Post-Pruning
  • can handle missing data
  • C5.0 is the latest release version under a proprietary license

Notes

  • you can read an introduction to ID3 in pgs.27-30 of Burkov (2019)

Associated Exercises¶

Now might be a good time to try exercises 1-6.

References¶

  1. James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An introduction to statistical learning. Vol. 112. New York: springer, 2013.
  2. Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
  3. Géron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.".
  4. Raschka, Sebastian, and Vahid Mirjalili. Python Machine Learning, 2nd Ed. Packt Publishing, 2017.
  5. https://github.com/rasbt/stat479-machine-learning-fs19/blob/master/06_trees/06-trees__notes.pdf
  6. Burkov, A. (2019). The hundred-page machine learning book (Vol. 1). Canada: Andriy Burkov.
  7. https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html#:~:text=Cost%20complexity%20pruning%20provides%20another,the%20number%20of%20nodes%20pruned.
  8. Quinlan, J. R. (1986). Induction of decision trees. Machine learning, 1 (1), 81-106
  9. Quinlan, J. R. (1993). C4. 5: Programming for machine learning. Morgan Kauffmann, 38, 48.
  10. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier
  11. https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3#:~:text=Feature%20importance%20is%20calculated%20as,the%20more%20important%20the%20feature.
  12. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)